Skip to content

Make IFBench eval tolerate missing or whitespace-shifted responses#27

Open
resolvicomai wants to merge 1 commit into
allenai:mainfrom
resolvicomai:codex/robust-eval-prompt-matching
Open

Make IFBench eval tolerate missing or whitespace-shifted responses#27
resolvicomai wants to merge 1 commit into
allenai:mainfrom
resolvicomai:codex/robust-eval-prompt-matching

Conversation

@resolvicomai

@resolvicomai resolvicomai commented May 20, 2026

Copy link
Copy Markdown

Summary

  • make prompt-to-response lookup tolerant of leading/trailing prompt whitespace
  • score missing model responses as failed examples instead of raising KeyError
  • create the eval output directory automatically
  • add regression coverage for whitespace drift and missing responses

Why

The current sample evaluation can fail before scoring because data/sample_output.jsonl has prompts with trailing whitespace and fewer rows than data/IFBench_test.jsonl. For eval and RLVR-style reward loops, a missing generation should produce a failed score, not abort the whole run.

Context: Prime Intellect IF-RLVR/Bench Algora bounty: https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

Validation

  • uv run pytest -q
  • rm -rf /tmp/ifbench-eval && uv run python -m run_eval --input_data=data/IFBench_test.jsonl --input_response_data=data/sample_output.jsonl --output_dir=/tmp/ifbench-eval

/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

@resolvicomai

Copy link
Copy Markdown
Author

/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

Scope note: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It is independent from trainer/reward-helper integrations such as #28: it makes the existing eval path runnable on bundled sample outputs, normalizes prompt whitespace, and treats missing generations as failed rows instead of aborting the run.

@resolvicomai

Copy link
Copy Markdown
Author

Friendly bounty-review ping: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It keeps the scope small by making the existing eval path complete on bundled sample outputs, normalizing prompt whitespace, and scoring missing generations as failed rows instead of aborting. Happy to adjust if the sponsor wants a different slice for the bounty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant